Lab: Batch Normalization

Batch Normalization

In the TensorFlow for Deep Learning lesson, Vincent Vanhoucke explained the importance of normalizing our inputs and how this helps the network learn. Batch normalization is an additional way to optimize network training.

Batch normalization is based on the idea that, instead of just normalizing the inputs to the network, we normalize the inputs to layers within the network. It's called "batch" normalization because during training, we normalize each layer's inputs by using the mean and variance of the values in the current mini-batch.

A network is a series of layers, where the output of one layer becomes the input to another. That means we can think of any layer in a neural network as the first layer of a smaller network.

For example, imagine a 3 layer network. Instead of just thinking of it as a single network with inputs, layers, and outputs, think of the output of layer 1 as the input to a two layer network. This two layer network would consist of layers 2 and 3 in our original network.

Likewise, the output of layer 2 can be thought of as the input to a single layer network, consisting only of layer 3.
When you think of it like that - as a series of neural networks feeding into each other - then it's easy to imagine how normalizing the inputs to each layer would help. It's just like normalizing the inputs to any other neural network, but you're doing it at every layer (sub-network).

Additional Resources

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Coding Batch Normalization

In tf.contrib.keras, batch normalization can be implemented with the following function definition:

from tensorflow.contrib.keras.python.keras import layers
output = layers.BatchNormalization()(input)

In your notebook, the separable_conv2d_batchnorm() function adds a batch normalization layer after the separable convolution layer. Introducing the batch normalization layer presents us with quite a few advantages.

Some of them are:

Networks train faster – Each training iteration will actually be slower because of the extra calculations during the forward pass. However, it should converge much more quickly, so training should be faster overall.
Allows higher learning rates – Gradient descent usually requires small learning rates for the network to converge. And as networks get deeper, their gradients get smaller during back propagation so they require even more iterations. Using batch normalization allows us to use much higher learning rates, which further increases the speed at which networks train.
Simplifies the creation of deeper networks – Because of the above reasons, it is easier to build and faster to train deeper neural networks when using batch normalization.
Provides a bit of regularization – Batch normalization adds a little noise to your network. In some cases, such as in Inception modules, batch normalization has been shown to work as well as dropout.